* San Francisco skyline Marco Brivio | Getty Images

Introduction to the project and data

The dataset used in the project is based on data collected from the 1990 California Census and was initially featured in the following paper:

Pace, R. Kelley, and Ronald Barry. “Sparse spatial autoregressions.” Statistics & Probability Letters 33.3 (1997): 291-297.

Each observation in the dataset includes information on blocks (different neighborhoods) in California in 1990. Although we may not directly apply the results in this project to California housing prices in the present, we will implement machine learning techniques to predict the median_house_value in California during that time period.

Variables in the housing dataset:

Data import and cleaning

# importing the data
housing <- read_csv(file="housing.csv", col_names=TRUE);

# number of observations in data frame
nrow1 <- nrow(housing);

# remove missing values
housing <- na.omit(housing);

# number of observations in data frame after removing NAs
nrow2 <- nrow(housing);

# make ocean_proximity lower case
housing$ocean_proximity <- tolower(housing$ocean_proximity);

# changing units of tens of thousand dollars to dollars
housing$median_income <- 10000 * housing$median_income;

kbl(head(housing, 25), caption="California housing dataset (first 25 observations)", align="c") %>%
  kable_paper() %>%
  scroll_box(width="100%", height="500px");
California housing dataset (first 25 observations)
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
-122.23 37.88 41 880 129 322 126 83252 452600 near bay
-122.22 37.86 21 7099 1106 2401 1138 83014 358500 near bay
-122.24 37.85 52 1467 190 496 177 72574 352100 near bay
-122.25 37.85 52 1274 235 558 219 56431 341300 near bay
-122.25 37.85 52 1627 280 565 259 38462 342200 near bay
-122.25 37.85 52 919 213 413 193 40368 269700 near bay
-122.25 37.84 52 2535 489 1094 514 36591 299200 near bay
-122.25 37.84 52 3104 687 1157 647 31200 241400 near bay
-122.26 37.84 42 2555 665 1206 595 20804 226700 near bay
-122.25 37.84 52 3549 707 1551 714 36912 261100 near bay
-122.26 37.85 52 2202 434 910 402 32031 281500 near bay
-122.26 37.85 52 3503 752 1504 734 32705 241800 near bay
-122.26 37.85 52 2491 474 1098 468 30750 213500 near bay
-122.26 37.84 52 696 191 345 174 26736 191300 near bay
-122.26 37.85 52 2643 626 1212 620 19167 159200 near bay
-122.26 37.85 50 1120 283 697 264 21250 140000 near bay
-122.27 37.85 52 1966 347 793 331 27750 152500 near bay
-122.27 37.85 52 1228 293 648 303 21202 155500 near bay
-122.26 37.84 50 2239 455 990 419 19911 158700 near bay
-122.27 37.84 52 1503 298 690 275 26033 162900 near bay
-122.27 37.85 40 751 184 409 166 13578 147500 near bay
-122.27 37.85 42 1639 367 929 366 17135 159800 near bay
-122.27 37.84 52 2436 541 1015 478 17250 113900 near bay
-122.27 37.84 52 1688 337 853 325 21806 99700 near bay
-122.27 37.84 52 2224 437 1006 422 26000 132600 near bay

207 observations were removed because there were missing values in the total_bedrooms column. The dataset now contains 20433 observations.

Exploratory, descriptive statistics and graphics

# geographic data on CA
ca <- map_data("state") %>%
  subset(region == "california");

# plotting points from CA housing dataset 
ggplotly(
  ggplot() +
    geom_polygon(data=ca, aes(x = long, y = lat), fill = "light grey", color = "black") +
    geom_point(data=housing, aes(longitude, latitude, col=median_house_value, label=population, label2=ocean_proximity, 
      label3=median_income), size=0.75) +
    theme_bw() +
    labs(x="Longitude", y="Latitude", title="California blocks geographic location"));

From the visualization above, we deduce that many blocks in the Bay Area and Southern California Coast have a higher median_house_value compared to blocks in the rest of the state.

# data frame with total counts of each ocean_proximity observation
ocean_dist <- housing %>%
  select(ocean_proximity) %>%
  group_by(ocean_proximity) %>%
  summarise(total=length(ocean_proximity));

# order columns based on counts
ocean_dist$ocean_proximity <- reorder(ocean_dist$ocean_proximity, -ocean_dist$total)

# bar chart
ggplotly(
  ggplot(ocean_dist, aes(ocean_proximity, total)) + 
    geom_bar(stat='identity', fill="dodgerblue") +
    labs(x="Ocean proximity", y="Total block groups", 
         title="Barchart of total block groups by ocean proximity in California"));

The bar chart above depicts the distribution of total number of blocks in California by ocean_proximity. The majority of CA blocks reside less than 1 hour from the ocean (9034 blocks) and inland (6496 blocks). The least amount of blocks in CA are on an island with 5 blocks.

# box plot median house value vs ocean proximity
options(scipen=10000);
ggplotly(
  ggplot(housing) +
    geom_boxplot(aes(ocean_proximity, median_house_value), fill="dodgerblue") + 
    labs(x="Ocean proximity", y="Median house value", 
         title="Boxplots of median house value by ocean proximity in California"));

The boxplots above depicts the median_house_value by ocean_proximity. From the chart, we see that blocks less than one hour from the ocean, near bay, and near ocean have median median_house_price at around $220,000. Island blocks have the largest median median_house_value at almost $415,000. Inland blocks have the lowest median median_house_value at just under $109,000. Blocks < 1h ocean, near bay, and near ocean have similar ranges of median_house_values. The majority of islands blocks have high median_house_value compared to other regions. The majority of inland blocks are cheap compared to the rest, but there are outliers with prices that rival the other regions.

# summary statistics
housing.mean <- sapply(housing[ , 3:9], mean);
housing.min <- sapply(housing[ , 3:9], min);
housing.max <- sapply(housing[ , 3:9], max);
housing.sd <- sapply(housing[ , 3:9], sd);
housing.median <- sapply(housing[ , 3:9], median);

housing.summary <- rbind(average=housing.mean, minimum=housing.min, maximum=housing.max, "standard deviation"=housing.sd, median=housing.median)

housing.summary <- round(housing.summary, 2);

kable(housing.summary, caption="Summary statistics for quantitative variables", align="c");
Summary statistics for quantitative variables
housing_median_age total_rooms total_bedrooms population households median_income median_house_value
average 28.63 2636.50 537.87 1424.95 499.43 38711.62 206864.4
minimum 1.00 2.00 1.00 3.00 1.00 4999.00 14999.0
maximum 52.00 39320.00 6445.00 35682.00 6082.00 150001.00 500001.0
standard deviation 12.59 2185.27 421.39 1133.21 382.30 18992.91 115435.7
median 29.00 2127.00 435.00 1166.00 409.00 35365.00 179700.0
# average for each variable by ocean_proximity
housing.mean.OP <- housing[ , 3:10] %>%
  group_by(ocean_proximity) %>%
  summarise_all(mean, na.rm=T);

# minimum for each variable by ocean_proximity
housing.min.OP <- housing[ , 3:10] %>%
  group_by(ocean_proximity) %>%
  summarise_all(min, na.rm=T);

# maximum for each variable by ocean_proximity
housing.max.OP <- housing[ , 3:10] %>%
  group_by(ocean_proximity) %>%
  summarise_all(max, na.rm=T);

# standard deviation for each variable by ocean_proximity
housing.sd.OP <- housing[ , 3:10] %>%
  group_by(ocean_proximity) %>%
  summarise_all(sd, na.rm=T);

# median for each variable by ocean_proximity
housing.median.OP <- housing[ , 3:10] %>%
  group_by(ocean_proximity) %>%
  summarise_all(median, na.rm=T);


housing.summary.op <- rbind(housing.mean.OP, housing.min.OP, housing.max.OP,
      housing.sd.OP, housing.median.OP);

# column that indicates the summary statistic applied
housing.summary.op$summary_stat <- c(rep("mean", 5), rep("min", 5), rep("max", 5),
                                     rep("sd", 5), rep("median", 5))

housing.summary.op <- housing.summary.op %>%
  relocate(summary_stat, .after=ocean_proximity) %>%
  arrange(ocean_proximity);

housing.summary.op[ , 3:9] <- round(housing.summary.op[ , 3:9], 2);

kable(housing.summary.op, align="c", caption="Summary statistics for variables by ocean proximity");
Summary statistics for variables by ocean proximity
ocean_proximity summary_stat housing_median_age total_rooms total_bedrooms population households median_income median_house_value
<1h ocean mean 29.28 2627.23 546.54 1518.44 517.42 42311.01 240267.99
<1h ocean min 2.00 11.00 5.00 3.00 4.00 4999.00 17500.00
<1h ocean max 52.00 37937.00 6445.00 35682.00 6082.00 150001.00 500001.00
<1h ocean sd 11.64 2164.15 427.91 1186.98 392.78 19983.00 106198.32
<1h ocean median 30.00 2107.00 438.00 1246.00 420.00 38790.00 215000.00
inland mean 24.26 2721.25 533.88 1392.41 478.01 32103.59 124896.86
inland min 1.00 2.00 2.00 5.00 2.00 4999.00 14999.00
inland max 52.00 39320.00 6210.00 16305.00 5358.00 150001.00 500001.00
inland sd 12.03 2390.70 446.12 1171.24 393.07 14371.15 70057.96
inland median 23.00 2136.00 423.00 1124.50 385.00 29898.00 108700.00
island mean 42.40 1574.60 420.40 668.00 276.60 27444.20 380440.00
island min 27.00 716.00 214.00 341.00 160.00 21579.00 287500.00
island max 52.00 2359.00 591.00 1100.00 431.00 33906.00 450000.00
island sd 13.16 707.55 169.32 301.69 113.20 4441.80 80559.56
island median 52.00 1675.00 512.00 733.00 288.00 27361.00 414700.00
near bay mean 37.76 2490.34 514.18 1227.88 487.24 41756.47 259279.29
near bay min 2.00 8.00 1.00 8.00 1.00 4999.00 22500.00
near bay max 52.00 18634.00 3226.00 8276.00 3052.00 150001.00 500001.00
near bay sd 13.07 1823.58 367.89 876.10 344.70 20211.20 122853.74
near bay median 39.00 2082.50 423.00 1033.50 404.50 38186.50 233800.00
near ocean mean 29.31 2587.17 538.62 1355.64 501.53 40063.74 249042.36
near ocean min 2.00 15.00 3.00 8.00 3.00 5360.00 22500.00
near ocean max 52.00 30405.00 4585.00 12873.00 4176.00 150001.00 500001.00
near ocean sd 11.84 1998.04 376.32 1008.13 345.14 20161.57 122548.01
near ocean median 29.00 2197.00 464.00 1137.50 429.00 36483.00 228750.00

Machine Learning Models

In our machine learning models, only the quantitative variables will be used. That is, all the variables excluding longitude, latitude, and ocean_proximity.

# splitting data into training and testing set
training.index <- sample(1:nrow(housing), nrow(housing)/2);

training <- housing[training.index, ] %>% select(!c(longitude, latitude, ocean_proximity));
kable(head(training, 10), caption="Training set (first 10 observations")
Training set (first 10 observations
housing_median_age total_rooms total_bedrooms population households median_income median_house_value
2 838 295 240 149 28750 237500
25 1709 439 632 292 17868 45500
15 2150 327 1094 324 60224 198500
22 3479 455 1454 488 66324 347600
4 4791 695 1871 659 69532 277000
28 859 199 455 211 23293 215900
16 2867 559 1203 449 27143 95300
19 2164 410 1309 426 33380 185300
25 3445 898 5558 894 30972 169300
41 2475 532 1416 470 38372 156400
testing <- housing[-training.index, ] %>% select(!c(longitude, latitude, ocean_proximity));
kable(head(testing, 10), caption="Testing set (first 10 observations)")
Testing set (first 10 observations)
housing_median_age total_rooms total_bedrooms population households median_income median_house_value
41 880 129 322 126 83252 452600
21 7099 1106 2401 1138 83014 358500
52 1467 190 496 177 72574 352100
42 2555 665 1206 595 20804 226700
52 3549 707 1551 714 36912 261100
52 2491 474 1098 468 30750 213500
52 696 191 345 174 26736 191300
52 1966 347 793 331 27750 152500
52 1228 293 648 303 21202 155500
50 2239 455 990 419 19911 158700

Regularization - Ridge Regression

training.x = model.matrix(median_house_value~. , training)[ , -1];
training.y = training$median_house_value;

testing.x = model.matrix(median_house_value~. , testing)[ , -1];
testing.y = testing$median_house_value;


set.seed(10);
# cross validations to choose lambda
cv.errors <- cv.glmnet(training.x, training.y, alpha=0);
plot(cv.errors);

By cross validation, the best lambda for ridge regression is \(\lambda\) = 7987.5455678.

ridge.model <- glmnet(training.x, training.y, alpha=0, lambda = cv.errors$lambda.min);

# coefficients of ridge regression model on testing set
coef(ridge.model);
## 7 x 1 sparse Matrix of class "dgCMatrix"
##                               s0
## (Intercept)        -14991.261587
## housing_median_age   1739.053791
## total_rooms            -5.017248
## total_bedrooms         54.769915
## population            -27.278244
## households             69.246459
## median_income           4.121757
ridge.predict <- predict(ridge.model, s=cv.errors$lambda.min, newx=testing.x);

ridge.test.error <- mean((ridge.predict - testing.y)^2);

Our estimated ridge regression model is:

median_house_value = -14991.26 + 1739.05housing_median_age + -5.02total_rooms + 54.77total_bedrooms + -27.28population + 69.25households + 4.12median_income

with a test error of MSE = 6059454667.61883.

Generalized Additive Models

# applied to the training set
gam1 = gam(median_house_value ~ s(housing_median_age) + 
             s(total_rooms) + s(total_bedrooms) + s(population) + 
             s(households) + s(median_income), data=training);
par(mfrow=c(2, 3));
plot(gam1, se=TRUE ,col =" blue");

# applied to the testing set
gam2 = gam(median_house_value ~ s(housing_median_age) + 
             s(total_rooms) + s(total_bedrooms) + s(population) + 
             s(households) + s(median_income), data=testing);

# summary of gam model on test data
summary(gam2);
## 
## Call: gam(formula = median_house_value ~ s(housing_median_age) + s(total_rooms) + 
##     s(total_bedrooms) + s(population) + s(households) + s(median_income), 
##     data = testing)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -463720  -45086  -10384   31970  543673 
## 
## (Dispersion Parameter for gaussian family taken to be 5098901994)
## 
##     Null Deviance: 136757485535296 on 10216 degrees of freedom
## Residual Deviance: 51968007544944 on 10192 degrees of freedom
## AIC: 257394.9 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                          Df         Sum Sq        Mean Sq  F value
## s(housing_median_age)     1  1802251875454  1802251875454   353.46
## s(total_rooms)            1  5211746561959  5211746561959  1022.13
## s(total_bedrooms)         1  3876069017624  3876069017624   760.18
## s(population)             1  6908955381469  6908955381469  1354.99
## s(households)             1  4271217680826  4271217680826   837.67
## s(median_income)          1 64968481315969 64968481315969 12741.66
## Residuals             10192 51968007544944     5098901994         
##                                      Pr(>F)    
## s(housing_median_age) < 0.00000000000000022 ***
## s(total_rooms)        < 0.00000000000000022 ***
## s(total_bedrooms)     < 0.00000000000000022 ***
## s(population)         < 0.00000000000000022 ***
## s(households)         < 0.00000000000000022 ***
## s(median_income)      < 0.00000000000000022 ***
## Residuals                                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                       Npar Df Npar F                 Pr(F)    
## (Intercept)                                                   
## s(housing_median_age)       3  18.44     0.000000000006345 ***
## s(total_rooms)              3 310.75 < 0.00000000000000022 ***
## s(total_bedrooms)           3 378.18 < 0.00000000000000022 ***
## s(population)               3 223.76 < 0.00000000000000022 ***
## s(households)               3 151.95 < 0.00000000000000022 ***
## s(median_income)            3 227.99 < 0.00000000000000022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(2, 3));
plot(gam2, se=TRUE ,col =" blue");

par(mfrow=c(1, 1));

pred <- predict(gam2, testing);

# test errors for the predictor variables
housing_median_age_MSE <- mean((testing$housing_median_age-pred)^2);
total_rooms_MSE <- mean((testing$total_rooms-pred)^2);
total_bedrooms_MSE <- mean((testing$total_bedrooms-pred)^2);
population_MSE <- mean((testing$population-pred)^2);
households_MSE <- mean((testing$households-pred)^2);
median_income_MSE <- mean((testing$median_income-pred)^2);

# Test Errors
gam.test.error <- as.data.frame(rbind(housing_median_age_MSE, total_rooms_MSE, total_bedrooms_MSE,
           population_MSE, households_MSE, median_income_MSE));
row.names(gam.test.error) <- names(housing[3:8]);
names(gam.test.error) <- "MSE";

Our estimated generalized additive model is:

median_house_value = -67960.6 + 1944.04s(housing_median_age) + -29.31s(total_rooms) + 150.42s(total_bedrooms) + -35.78s(population) + 129.44s(households) + 5.22s(median_income)

with a test errors for each variable of
Test errors for each predictor variable in GAM model
MSE
housing_median_age 51089443027
total_rooms 49934943084
total_bedrooms 50869865093
population 50511724563
households 50885338294
median_income 33914834260

Polynomial Regression

# matrix plot for housing dataset
ggpairs(housing[ , 3:9]);

Based on the matrix plot above, the variable with the highest correlation with median_house_value is median_income with a correlation coefficient of 0.6883555.

# scatterplot median_house_value vs median_income
ggplotly(ggplot(housing) + 
  geom_point(aes(median_income, median_house_value), size=1, col="dodgerblue") + 
  labs(x="Median income", y="Median house value", title="Scatterplot of median house value vs median income"));

From the scatterplot above, we can clearly see that there is a positive association between median_income and median_house_value.

# cross validation to choose optimal polynomial degree
cv.error=rep(0,10);
for (i in 1:10) {
  glm.fit = glm(median_house_value~poly(median_income, i), data = housing[sample(1:nrow(housing), 5000), ]);
  cv.error[i] = cv.glm(housing[sample(1:nrow(housing), 5000), ], glm.fit)$delta[1];
}

# plot of test errors by degree
ggplotly(  
  ggplot() +
    geom_point(aes(1:length(cv.error), cv.error), col="blue", size=2) +
    geom_line(aes(1:length(cv.error), cv.error), size=0.5) +
    labs(x="Polynomial degree", y="Test error (MSE)", title="Test error (MSE) by polynomial degree") +
    theme_bw() + 
    scale_x_continuous(breaks=1:10));
data.frame(poly.degree=1:length(cv.error), MSE=cv.error);
##    poly.degree         MSE
## 1            1 19522304422
## 2            2 19423797649
## 3            3 19430019567
## 4            4 20399643152
## 5            5 19236628607
## 6            6 20195506476
## 7            7 19334874831
## 8            8 19962243574
## 9            9 19680937228
## 10          10 19395387618
optimal.degree <- which(cv.error==min(cv.error));

# polynomial regression model
median_incomelims = range(housing$median_income);
median_income.grid=seq(from=median_incomelims[1], to=median_incomelims[2]);
poly.fit <- lm(median_house_value ~  poly(median_income, optimal.degree), data=housing);
pred = predict(poly.fit, newdata =list(median_income=median_income.grid),se=T);
se.band = cbind(pred$fit +2*pred$se.fit, pred$fit -2* pred$se.fit);

# summary of the polynomial regression model
summary(poly.fit);
## 
## Call:
## lm(formula = median_house_value ~ poly(median_income, optimal.degree), 
##     data = housing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -392860  -55972  -16451   35739  413140 
## 
## Coefficients:
##                                        Estimate Std. Error t value
## (Intercept)                            206864.4      578.6 357.513
## poly(median_income, optimal.degree)1 11358166.4    82710.4 137.325
## poly(median_income, optimal.degree)2 -1110193.0    82710.4 -13.423
## poly(median_income, optimal.degree)3 -1498110.7    82710.4 -18.113
## poly(median_income, optimal.degree)4   165372.8    82710.4   1.999
## poly(median_income, optimal.degree)5   105491.6    82710.4   1.275
##                                                 Pr(>|t|)    
## (Intercept)                          <0.0000000000000002 ***
## poly(median_income, optimal.degree)1 <0.0000000000000002 ***
## poly(median_income, optimal.degree)2 <0.0000000000000002 ***
## poly(median_income, optimal.degree)3 <0.0000000000000002 ***
## poly(median_income, optimal.degree)4              0.0456 *  
## poly(median_income, optimal.degree)5              0.2022    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82710 on 20427 degrees of freedom
## Multiple R-squared:  0.4867, Adjusted R-squared:  0.4866 
## F-statistic:  3874 on 5 and 20427 DF,  p-value: < 0.00000000000000022
# scatter plot with optimal polynomial fit
ggplotly(ggplot(housing, aes(median_income, median_house_value)) +
  geom_point(col="dark grey") +
  geom_smooth(method = lm, formula = y ~ poly(x, which(cv.error==min(cv.error)), raw = TRUE), se=F, col="blue") + 
  labs(x="Median income", y="Median house value", title="Median house value vs median income with polynomial of optimal degree"));

The optimal degree for polynomial regression is 5.

Our estimated polynomial regression model is:

median_house_value = 206864.41 + 11358166.43median_income + -1110192.98median_income^2 + -1498110.74median_income^3 + 165372.78median_income^4 + 105491.6median_income^5

with a test error of MSE = 19236628606.76

Performance on test data compared across models

As we saw previously,

Conclusion

Based on the test errors for each Machine Learning method, we can conclude that the ridge regression model is best (compared to the other 2 methods) in predicting the median_house_value for blocks in California around the year 1990 since this method has the smallest test error MSE = 6059454667.62.

The ridge regression model pertaining to our data is:

median_house_value = -14991.26 + 1739.05housing_median_age + -5.02total_rooms + 54.77total_bedrooms + -27.28population + 69.25households + 4.12median_income